We have been constantly told this statement “Computers don’t lie”. Yes in fact Computers don’t lie, but neither does it speak the truth. A computer does what its Master programs it to do. Similarly, A model wouldn’t lie unless the Machine Learning Engineer doesn’t want it to lie.
There was a nice episode of the podcast You are not so smart came out last year. This is an excerpt from it:
“I want a machine-learning algorithm to learn what tumors looked like in the past, and I want it to become biased toward selecting those kind of tumors in the future,” explains philosopher Shannon Vallor at Santa Clara University. “But I don’t want a machine-learning algorithm to learn what successful engineers and doctors looked like in the past and then become biased toward selecting those kinds of people when sorting and ranking resumes.”
Machine Bias can occur due to a lot of factors but a few to name is:
Below is an example of how Google Translate, when translated the following text to a Gender-neutral langauge and back to English - applies its bias (primarily due to the nature of biased Training Dataset)
img
The first step of finding solution to any problem is accepting The Problem exists. Let’s accept that fact and see how to use Kaggle Survey results and help the community tackle Machine Bias.
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(highcharter))
suppressPackageStartupMessages(library(DataExplorer))
suppressPackageStartupMessages(library(scales))
suppressPackageStartupMessages(library(cowplot))
suppressPackageStartupMessages(library(viridis))
plotting_missing <- function(df){
#based on erikbruin's code snippet
NAcol <- which(colSums(is.na(df)) > 0)
NAcount <- sort(colSums(sapply(df[NAcol], is.na)), decreasing = TRUE)
NADF <- data.frame(variable=names(NAcount), missing=NAcount)
NADF$PctMissing <- round(((NADF$missing/nrow(df))*100),1)
NADF %>%
ggplot(aes(x=reorder(variable, PctMissing), y=PctMissing)) +
geom_bar(stat='identity', fill='red') + coord_flip(y=c(0,110)) +
labs(x="", y="Percent missing") +
geom_text(aes(label=paste0(NADF$PctMissing, "%"), hjust=-0.1))
}The above plot is to demonstrate how much these questions that are about Model Fairness / Bias, have been ignored.
While asking about Salary made 15% of respondents to not answer, Questions about Reproducibility, Explainability and Bias made 37% of respondents to skip answering. The salary question comparsion is here to show relatively worse questions like this are approached.
survey %>% select(contains("How do you perceive the importance of the following topics?")) %>%
gather() %>%
mutate(key = str_replace(key,"-","\n")) %>%
mutate(key = str_replace(key,"How do you perceive the importance of the following topics?",""),
key = str_replace(key, regex("\\?"),""),
key = str_replace(key, regex("\\-|\\:"),"")) %>%
group_by(key) %>%
count(value) %>%
drop_na() %>%
mutate(n = n / sum(n)) %>%
ggplot() + geom_col(aes(value,n, fill = key), stat = "identity", show.legend = FALSE) +
geom_label(aes(x = value, y = n - 0.05, label = percent(n)),
# hjust=0, vjust=0, size = 4, colour = 'black',
fontface = 'bold') +
scale_fill_viridis(discrete = T, option = "E") +
facet_wrap(~key) +
scale_y_continuous(labels = percent_format()) +
theme_minimal() +
theme(axis.text = element_text(angle = 45, size = 6)) +
labs(title = "Perception on Reproducibility, Explainability and Model Bias ",
subtitle = "Percentage",
x = "Selected Options",
y = "Percentage of Respondents (other than NAs)")To get a better perspective of the volume of the respondents, below is the same plot as above but with absolute numbers of respondents and their options.
survey %>% select(contains("How do you perceive the importance of the following topics?")) %>%
gather() %>%
mutate(key = str_replace(key,"-","\n")) %>%
mutate(key = str_replace(key,"How do you perceive the importance of the following topics?",""),
key = str_replace(key, regex("\\?"),""),
key = str_replace(key, regex("\\-|\\:"),"")) %>%
group_by(key) %>%
count(value) %>%
drop_na() %>%
ggplot() + geom_col(aes(value,n, fill = key), stat = "identity", show.legend = FALSE) +
geom_label(aes(x = value, y = n + 20, label = n),
# hjust=0, vjust=0, size = 4, colour = 'black',
fontface = 'bold') +
scale_fill_viridis(discrete = T, option = "E") +
facet_wrap(~key) +
#scale_y_continuous(labels = percent_format()) +
theme_minimal() +
theme(axis.text = element_text(angle = 45, size = 6)) +
labs(title = "Perception on Reproducibility, Explainability and Model Bias ",
subtitle = "Absolute Numbers",
x = "Selected Options",
y = "Percentage of Respondents (other than NAs)")Fairness and Bias:
Only close to half (57.4%) of the respondenrts who chose to answer consider Fairness and Bias in ML Algorithm is a Very important.
This is the lowest Very important sentiment echoed by the community of all the 3 questions.
3.6% of those who chose to respondent perceive this is Not at all important, which is the highest Not at all important feeling expressed of all 3 questions.
survey %>% select(contains("How do you perceive the importance of the following topics?")) %>%
gather() %>%
mutate(key = str_replace(key,"How do you perceive the importance of the following topics?",""),
key = str_replace(key, regex("\\?"),""),
key = str_replace(key, regex("\\-"),"")) %>%
group_by(key) %>%
count(value) %>%
mutate(n = percent(n / sum(n))) %>%
spread(value,n) %>%
knitr::kable()| key | No opinion; I do not know | Not at all important | Slightly important | Very important | |
|---|---|---|---|---|---|
| Being able to explain ML model outputs and/or predictions | 2.9% | 1.6% | 17.0% | 41.1% | 37.4% |
| Fairness and bias in ML algorithms: | 5.4% | 2.3% | 19.0% | 36.0% | 37.4% |
| Reproducibility in data science | 3.8% | 1.0% | 14.9% | 42.9% | 37.4% |
They refer to those beings who think Fairness and Bias are Very Important in Machine Learning.
they <- survey %>% filter(`How do you perceive the importance of the following topics? - Fairness and bias in ML algorithms:` == "Very important")
not_they <- survey %>% filter(`How do you perceive the importance of the following topics? - Fairness and bias in ML algorithms:` != "Very important")they %>% group_by(`What is your gender? - Selected Choice`) %>% count() %>% ungroup() %>%
rename("Gender" = `What is your gender? - Selected Choice`) %>%
mutate(n = n / sum(n),
perc = percent(n)) %>%
ggplot() + geom_col(aes(Gender,n, fill = Gender), stat = "identity", show.legend = FALSE) +
geom_label(aes(x = Gender, y = n - 0.05, label = percent(n)),
# hjust=0, vjust=0, size = 4, colour = 'black',
fontface = 'bold') +
scale_fill_viridis(discrete = T, option = "E") +
scale_y_continuous(labels = percent_format()) +
theme_minimal() +
theme(axis.text = element_text(angle = 45, size = 6)) +
labs(title = "They",
subtitle = "Perception on Fairness and Model Bias ",
x = "Gender",
y = "Percentage of Respondents (other than NAs)") -> p1
not_they %>% group_by(`What is your gender? - Selected Choice`) %>% count() %>% ungroup() %>%
rename("Gender" = `What is your gender? - Selected Choice`) %>%
mutate(n = n / sum(n),
perc = percent(n)) %>%
ggplot() + geom_col(aes(Gender,n, fill = Gender), stat = "identity", show.legend = FALSE) +
geom_label(aes(x = Gender, y = n - 0.05, label = percent(n)),
# hjust=0, vjust=0, size = 4, colour = 'black',
fontface = 'bold') +
scale_fill_viridis(discrete = T, option = "E") +
scale_y_continuous(labels = percent_format()) +
theme_minimal() +
theme(axis.text = element_text(angle = 45, size = 6)) +
labs(title = "Not They",
subtitle = "Perception on Fairness and Model Bias ",
x = "Gender",
y = "Percentage of Respondents (other than NAs)") -> p2
cowplot::plot_grid(p1,p2)There is a difference of 5.1 PP Female Percentage difference between those who perceive Model Fariness & Bias in ML is Very Important and Others.
While this could be seen as that Female Gender usually gets affected by these Biases, It’s also important to realize that Male Gender (Kaggler’s) don’t echo similar sentiment as their female counterpart. After all, A healthy model is what we all want, don’t we?
they %>% group_by(`What is your age (# years)?`) %>% count() %>% ungroup() %>%
rename("Age" = `What is your age (# years)?`) %>%
mutate(n = n / sum(n),
perc = percent(n)) %>%
ggplot() + geom_col(aes(Age,n, fill = Age), stat = "identity", show.legend = FALSE) +
geom_label(aes(x = Age, y = n - 0.05, label = percent(n)),
# hjust=0, vjust=0, size = 4, colour = 'black',
fontface = 'bold') +
scale_fill_viridis(discrete = T, option = "D") +
scale_y_continuous(labels = percent_format()) +
theme_minimal() +
theme(axis.text = element_text(angle = 45, size = 6)) +
labs(title = "They",
subtitle = "Perception on Fairness and Model Bias ",
x = "Age",
y = "Percentage of Respondents (other than NAs)") -> p1
not_they %>% group_by(`What is your age (# years)?`) %>% count() %>% ungroup() %>%
rename("Age" = `What is your age (# years)?`) %>%
mutate(n = n / sum(n),
perc = percent(n)) %>%
ggplot() + geom_col(aes(Age,n, fill = Age), stat = "identity", show.legend = FALSE) +
geom_label(aes(x = Age, y = n - 0.05, label = percent(n)),
# hjust=0, vjust=0, size = 4, colour = 'black',
fontface = 'bold') +
scale_fill_viridis(discrete = T, option = "D") +
scale_y_continuous(labels = percent_format()) +
theme_minimal() +
theme(axis.text = element_text(angle = 45, size = 6)) +
labs(title = "Not They",
subtitle = "Perception on Fairness and Model Bias ",
x = "Age",
y = "Percentage of Respondents (other than NAs)") -> p2
cowplot::plot_grid(p1,p2)Age doesn’t seem to give anything straightway, which probably could be due to a lot of different age brackets. Let us try a bit of engineering to club them into two groups < 30 and > 30.
they %>%
mutate(age_grp = ifelse(parse_number(`What is your age (# years)?`) < 30,
"Less than 30",
"30+")) %>%
group_by(age_grp) %>% count() %>% ungroup() %>%
rename("Age" = age_grp) %>%
mutate(n = n / sum(n),
perc = percent(n)) %>%
ggplot() + geom_col(aes(Age,n, fill = Age), stat = "identity", show.legend = FALSE) +
geom_label(aes(x = Age, y = n - 0.05, label = percent(n)),
# hjust=0, vjust=0, size = 4, colour = 'black',
fontface = 'bold') +
scale_fill_viridis(discrete = T, option = "E") +
scale_y_continuous(labels = percent_format()) +
theme_minimal() +
theme(axis.text = element_text(angle = 45, size = 6)) +
labs(title = "They",
subtitle = "Perception on Fairness and Model Bias ",
x = "Age",
y = "Percentage of Respondents (other than NAs)") -> p1
not_they %>%
mutate(age_grp = ifelse(parse_number(`What is your age (# years)?`) < 30,
"Less than 30",
"30+")) %>%
group_by(age_grp) %>% count() %>% ungroup() %>%
rename("Age" = age_grp) %>%
mutate(n = n / sum(n),
perc = percent(n)) %>%
ggplot() + geom_col(aes(Age,n, fill = Age), stat = "identity", show.legend = FALSE) +
geom_label(aes(x = Age, y = n - 0.05, label = percent(n)),
# hjust=0, vjust=0, size = 4, colour = 'black',
fontface = 'bold') +
scale_fill_viridis(discrete = T, option = "E") +
scale_y_continuous(labels = percent_format()) +
theme_minimal() +
theme(axis.text = element_text(angle = 45, size = 6)) +
labs(title = "Not They",
subtitle = "Perception on Fairness and Model Bias ",
x = "Age",
y = "Percentage of Respondents (other than NAs)") -> p2
cowplot::plot_grid(p1,p2)This plot helps us say that the younger ones need to be updated with the implications of Model Bias and Fairness more than their older counterparts. That leads us to another important section of what they do.
they %>%
mutate(title = ifelse(`Select the title most similar to your current role (or most recent title if retired): - Selected Choice` == "Student",
"Student",
"Professional")) %>%
group_by(title) %>% count() %>% ungroup() %>%
rename("Title" = title) %>%
mutate(n = n / sum(n),
perc = percent(n)) %>%
ggplot() + geom_col(aes(Title,n, fill = Title), stat = "identity", show.legend = FALSE) +
geom_label(aes(x = Title, y = n - 0.05, label = percent(n)),
# hjust=0, vjust=0, size = 4, colour = 'black',
fontface = 'bold') +
scale_fill_viridis(discrete = T, option = "C") +
scale_y_continuous(labels = percent_format()) +
theme_minimal() +
theme(axis.text = element_text(angle = 45, size = 6)) +
labs(title = "They",
subtitle = "Perception on Fairness and Model Bias ",
x = "Title",
y = "Percentage of Respondents (other than NAs)") -> p1
not_they %>%
mutate(title = ifelse(`Select the title most similar to your current role (or most recent title if retired): - Selected Choice` == "Student",
"Student",
"Professional")) %>%
group_by(title) %>% count() %>% ungroup() %>%
rename("Title" = title) %>%
mutate(n = n / sum(n),
perc = percent(n)) %>%
ggplot() + geom_col(aes(Title,n, fill = Title), stat = "identity", show.legend = FALSE) +
geom_label(aes(x = Title, y = n - 0.05, label = percent(n)),
# hjust=0, vjust=0, size = 4, colour = 'black',
fontface = 'bold') +
scale_fill_viridis(discrete = T, option = "C") +
scale_y_continuous(labels = percent_format()) +
theme_minimal() +
theme(axis.text = element_text(angle = 45, size = 6)) +
labs(title = "Not They",
subtitle = "Perception on Fairness and Model Bias ",
x = "Title",
y = "Percentage of Respondents (other than NAs)") -> p2
cowplot::plot_grid(p1,p2)they %>%
mutate(UG = ifelse(`Which best describes your undergraduate major? - Selected Choice` %in% c("Computer science (software engineering, etc.)","Information technology, networking, or system administration"),
"CS",
"Non_CS")) %>%
group_by(UG) %>% count() %>% ungroup() %>%
rename("UG" = UG) %>%
mutate(n = n / sum(n),
perc = percent(n)) %>%
ggplot() + geom_col(aes(UG,n, fill = UG), stat = "identity", show.legend = FALSE) +
geom_label(aes(x = UG, y = n - 0.05, label = percent(n)),
# hjust=0, vjust=0, size = 4, colour = 'black',
fontface = 'bold') +
scale_fill_viridis(discrete = T, option = "D") +
scale_y_continuous(labels = percent_format()) +
theme_minimal() +
theme(axis.text = element_text(angle = 45, size = 6)) +
labs(title = "They",
subtitle = "Perception on Fairness and Model Bias ",
x = "Title",
y = "Percentage of Respondents (other than NAs)") -> p1
not_they %>%
mutate(UG = ifelse(`Which best describes your undergraduate major? - Selected Choice` %in% c("Computer science (software engineering, etc.)","Information technology, networking, or system administration"),
"CS",
"Non_CS")) %>%
group_by(UG) %>% count() %>% ungroup() %>%
rename("UG" = UG) %>%
mutate(n = n / sum(n),
perc = percent(n)) %>%
ggplot() + geom_col(aes(UG,n, fill = UG), stat = "identity", show.legend = FALSE) +
geom_label(aes(x = UG, y = n - 0.05, label = percent(n)),
# hjust=0, vjust=0, size = 4, colour = 'black',
fontface = 'bold') +
scale_fill_viridis(discrete = T, option = "D") +
scale_y_continuous(labels = percent_format()) +
theme_minimal() +
theme(axis.text = element_text(angle = 45, size = 6)) +
labs(title = "Not They",
subtitle = "Perception on Fairness and Model Bias ",
x = "Title",
y = "Percentage of Respondents (other than NAs)") -> p2
cowplot::plot_grid(p1,p2)they %>% #filter(`In which country do you currently reside?` %in% c("India","United States of America")) %>%
group_by(`What specific programming language do you use most often? - Selected Choice`) %>% count() %>% ungroup() %>%
rename("Lang" = `What specific programming language do you use most often? - Selected Choice`,
"They" = n) %>%
bind_cols(
not_they %>% #filter(`In which country do you currently reside?` %in% c("India","United States of America")) %>%
group_by(`What specific programming language do you use most often? - Selected Choice`) %>% count() %>% ungroup() %>%
rename("Lang1" = `What specific programming language do you use most often? - Selected Choice`,
"Not They" = n) %>%
select(-Lang1)) %>%
mutate("T_NT_Ratio" = round(They/`Not They`,3)) %>%
arrange(desc(T_NT_Ratio)) %>%
hchart("bar",hcaes("Lang","T_NT_Ratio")) %>%
hc_title(text = "Language ordered by They-NotThey Ratio") %>%
hc_add_theme(hc_theme_538()) they %>% #filter(`In which country do you currently reside?` %in% c("India","United States of America")) %>%
group_by(`In which country do you currently reside?`) %>% count() %>% ungroup() %>%
rename("Country" = `In which country do you currently reside?`,
"They" = n) %>%
bind_cols(
not_they %>% #filter(`In which country do you currently reside?` %in% c("India","United States of America")) %>%
group_by(`In which country do you currently reside?`) %>% count() %>% ungroup() %>%
rename("Country1" = `In which country do you currently reside?`,
"Not They" = n) %>%
select(-Country1)) %>%
mutate("T_NT_Ratio" = round(They/`Not They`,3)) %>%
arrange(desc(T_NT_Ratio)) %>%
hchart("line",hcaes("Country","T_NT_Ratio")) %>%
hc_title(text = "They vs Not They Ratio - Country-wise") %>%
hc_add_theme(hc_theme_538())they %>% #filter(`In which country do you currently reside?` %in% c("India","United States of America")) %>%
group_by(`In which country do you currently reside?`) %>% count() %>% ungroup() %>%
rename("Country" = `In which country do you currently reside?`,
"They" = n) %>%
bind_cols(
not_they %>% #filter(`In which country do you currently reside?` %in% c("India","United States of America")) %>%
group_by(`In which country do you currently reside?`) %>% count() %>% ungroup() %>%
rename("Country1" = `In which country do you currently reside?`,
"Not They" = n) %>%
select(-Country1)) %>%
filter((They + `Not They`) > 100) %>%
mutate("T_NT_Ratio" = round(They/`Not They`,3)) %>%
arrange(desc(T_NT_Ratio)) %>%
hchart("line",hcaes("Country","T_NT_Ratio")) %>%
hc_title(text = "They vs Not They Ratio - Country-wise > 100 respondents") %>%
hc_add_theme(hc_theme_538())they %>% #filter(`In which country do you currently reside?` %in% c("India","United States of America")) %>%
group_by(`In what industry is your current employer/contract (or your most recent employer if retired)? - Selected Choice`) %>% count() %>% ungroup() %>%
rename("Industry" = `In what industry is your current employer/contract (or your most recent employer if retired)? - Selected Choice`,
"They" = n) %>%
bind_cols(
not_they %>% #filter(`In which country do you currently reside?` %in% c("India","United States of America")) %>%
group_by(`In what industry is your current employer/contract (or your most recent employer if retired)? - Selected Choice`) %>% count() %>% ungroup() %>%
rename("Industry1" = `In what industry is your current employer/contract (or your most recent employer if retired)? - Selected Choice`,
"Not They" = n) %>%
select(-Industry1)) %>%
#filter((They + `Not They`) > 100) %>%
mutate("T_NT_Ratio" = round(They/`Not They`,3)) %>%
filter(!Industry %in% c("3","Other")) %>%
arrange(desc(T_NT_Ratio)) %>%
hchart("line",hcaes("Industry","T_NT_Ratio")) %>%
hc_title(text = "They vs Not They Ratio - Industry-wise") %>%
hc_add_theme(hc_theme_538())Kagglers in Industries like Non_Profit/Service and Government/Public Service have been better perception about the importance of Model Fairness and Bias.
It’s also unhealthy to see places like Military and Internet-based Services falling behind as those are the places where the model evaluation is crucial and can have serious outcomes.
they %>% #filter(`In which country do you currently reside?` %in% c("India","United States of America")) %>%
group_by(`Do you consider yourself to be a data scientist?`) %>% count() %>% ungroup() %>%
rename("DS" = `Do you consider yourself to be a data scientist?`,
"They" = n) %>%
bind_cols(
not_they %>% #filter(`In which country do you currently reside?` %in% c("India","United States of America")) %>%
group_by(`Do you consider yourself to be a data scientist?`) %>% count() %>% ungroup() %>%
rename("DS1" = `Do you consider yourself to be a data scientist?`,
"Not They" = n)# %>%
# select(-DS1)
) %>%
#filter((They + `Not They`) > 100) %>%
mutate("T_NT_Ratio" = round(They/`Not They`,3)) %>%
filter(!DS %in% c("3","Other")) %>%
arrange(desc(T_NT_Ratio)) %>%
hchart("line",hcaes("DS","T_NT_Ratio")) %>%
hc_title(text = "They vs Not They Ratio - Based on if they consider themsevles a Data Scientist") %>%
hc_add_theme(hc_theme_538())they %>%
dplyr::select(contains("types of data")) %>%
dplyr::select(-contains("Text")) %>%
gather(1:11, key = "questions", value = "DataType") %>%
#tidyr::replace_na() %>%
group_by(DataType) %>%
count() %>%
rename("They" = n) %>%
bind_cols(
not_they %>%
dplyr::select(contains("types of data")) %>%
dplyr::select(-contains("Text")) %>%
gather(1:11, key = "questions", value = "DataType") %>%
#tidyr::replace_na() %>%
group_by(DataType) %>%
count() %>%
rename("Not They" = n)
) %>%
dplyr::select(-DataType1) %>%
mutate("T_NT_Ratio" = round(They/`Not They`,3)) %>%
arrange(desc(T_NT_Ratio)) %>%
hchart("line",hcaes("DataType","T_NT_Ratio")) %>%
hc_title(text = "They vs Not They Ratio - Type of Data") %>%
hc_add_theme(hc_theme_538())